The team, Miranda Diaz and Raiya Dhawala will likely be investigating data sets related to rat sightings in New York City and how they have been affected by COVID-19, and the frequency by bourough.
The data set we are working with is Rat Sightings. This data set contains information on rat sightings in New York City from 2010 to present day based on service requests made by the community. The data is from 311 service requests. Each row is different rat sighting with a unique key, which describes the unique identifier of the Service Request and in theory a different rat, although there is no way to be sure. Some columns describe the address, the date, the streets, the landmark and the borough of where each rat was seen. There is also a column which shows when the service request was created. We used this to divide rat sightings by time, day, month and year. We were interested in using this data set because it provides data that can be relevant when answering different questions, such as, “How have rat sightings increased or decreased in different boroughs since COVID-19”. We brought in further data to assess COVID-19 numbers in different New York City neighborhoods. We also brought in NYC population data to analyze rat sightings per 1000 people in different boroughs. Further data may look at the population of different neighborhoods, and whether certain neighborhoods make more 311 requests than others.
Our collaboration plan is to meet on Thursdays around 4:30 to work on the final project. We will be communicating through text messages to keep each other updated and organized. We will be using google docs to draft things and then transfer it into google colab.
We loaded our dataset into the dataset folder. This dataset originally has 38 columns showing things such as the date the request was created, the agency name, the location (including neighborhood and address), and the status of the complaint. We tidied the data by getting rid of the repetitive location columns such as cross street 1 and 2, and only keeping the incident address and neighborhood. We then created a month, year column and set the date as the index.
#import pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
#mount drive
from google.colab import drive
drive.mount('/content/drive')
#read in rat csv
rat_df = pd.read_csv('/content/drive/My Drive/Rat_Sightings.csv')
#display it
rat_df.head()
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
<ipython-input-20-de1a53ac8c63>:12: DtypeWarning: Columns (20) have mixed types. Specify dtype option on import or set low_memory=False.
| Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15636547 | 01/02/2010 09:38:29 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Rat Sighting | Other (Explain Below) | 11432.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.703342 | -73.800203 | (40.70334174980328, -73.80020318978804) |
| 1 | 15636689 | 01/02/2010 07:09:56 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Rat Sighting | Catch Basin/Sewer | 11204.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.608288 | -73.977775 | (40.60828832742201, -73.97777515791093) |
| 2 | 15636710 | 01/02/2010 09:04:46 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Rat Sighting | 3+ Family Mixed Use Building | 11375.0 | 68-12 YELLOWSTON BOULEVARD | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.726060 | -73.848329 | (40.726059538480236, -73.84832900182792) |
| 3 | 15636731 | 01/02/2010 06:27:59 PM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Rat Sighting | 3+ Family Mixed Use Building | 11233.0 | 1040 HERKIMER STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.677957 | -73.918921 | (40.67795748580213, -73.91892129635086) |
| 4 | 15636907 | 01/02/2010 12:50:16 PM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Rat Sighting | 3+ Family Apt. Building | 10034.0 | 241 SHERMAN AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.866008 | -73.919401 | (40.86600810878767, -73.9194006525354) |
5 rows × 38 columns
We made the formatted date the index of the rat dataframe so that it was easier to manipulate with other datasets such as the COVID data.
#get rid of unnecesary columns
rat_df.drop(columns = ['Closed Date','Agency', 'Cross Street 1', 'Cross Street 2', 'Intersection Street 1', 'Intersection Street 2', 'Address Type', 'Landmark', 'Status','Due Date','Resolution Action Updated Date', 'Street Name', 'City', 'Facility Type', 'Due Date', 'X Coordinate (State Plane)', 'Y Coordinate (State Plane)', 'Vehicle Type', 'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name', 'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment', 'Location' ], inplace=True)
#use to_datetime function to change Created Date to Datetime format so we can extract month and year
rat_df['Created Date'] = pd.to_datetime(rat_df['Created Date'], format='%m/%d/%Y %I:%M:%S %p')
#Extract month and year into new columns
rat_df['Month'] = rat_df['Created Date'].dt.month
rat_df['Year'] = rat_df['Created Date'].dt.year
rat_df['Day'] = rat_df['Created Date'].dt.day
rat_df['formatted_date'] = pd.to_datetime(rat_df[['Month', 'Day', 'Year']]).dt.strftime('%m/%d/%Y')
rat_df['formatted_date'] = pd.to_datetime(rat_df['formatted_date'], format='%m/%d/%Y')
#set Created Date as index
rat_df.set_index("formatted_date", inplace=True)
rat_df.head()
| Unique Key | Created Date | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Community Board | Borough | Park Facility Name | Park Borough | Latitude | Longitude | Month | Year | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| formatted_date | |||||||||||||||||
| 2010-01-02 | 15636547 | 2010-01-02 09:38:29 | Department of Health and Mental Hygiene | Rodent | Rat Sighting | Other (Explain Below) | 11432.0 | NaN | 12 QUEENS | QUEENS | Unspecified | QUEENS | 40.703342 | -73.800203 | 1 | 2010 | 2 |
| 2010-01-02 | 15636689 | 2010-01-02 07:09:56 | Department of Health and Mental Hygiene | Rodent | Rat Sighting | Catch Basin/Sewer | 11204.0 | NaN | 11 BROOKLYN | BROOKLYN | Unspecified | BROOKLYN | 40.608288 | -73.977775 | 1 | 2010 | 2 |
| 2010-01-02 | 15636710 | 2010-01-02 09:04:46 | Department of Health and Mental Hygiene | Rodent | Rat Sighting | 3+ Family Mixed Use Building | 11375.0 | 68-12 YELLOWSTON BOULEVARD | 06 QUEENS | QUEENS | Unspecified | QUEENS | 40.726060 | -73.848329 | 1 | 2010 | 2 |
| 2010-01-02 | 15636731 | 2010-01-02 18:27:59 | Department of Health and Mental Hygiene | Rodent | Rat Sighting | 3+ Family Mixed Use Building | 11233.0 | 1040 HERKIMER STREET | 03 BROOKLYN | BROOKLYN | Unspecified | BROOKLYN | 40.677957 | -73.918921 | 1 | 2010 | 2 |
| 2010-01-02 | 15636907 | 2010-01-02 12:50:16 | Department of Health and Mental Hygiene | Rodent | Rat Sighting | 3+ Family Apt. Building | 10034.0 | 241 SHERMAN AVENUE | 12 MANHATTAN | MANHATTAN | Unspecified | MANHATTAN | 40.866008 | -73.919401 | 1 | 2010 | 2 |
#display datatypes
display(rat_df.dtypes)
Unique Key int64 Created Date datetime64[ns] Agency Name object Complaint Type object Descriptor object Location Type object Incident Zip float64 Incident Address object Community Board object Borough object Park Facility Name object Park Borough object Latitude float64 Longitude float64 Month int64 Year int64 Day int64 dtype: object
We created a data table with rat sighting count per day so as to compare to the COVID daily cases with daily rat sightings
#Make dataframe which has amount of rats spotted per day
rat_per_day_df = rat_df.groupby('formatted_date').size().reset_index(name='rat_sightings_count')
rat_per_day_df.head()
| formatted_date | rat_sightings_count | |
|---|---|---|
| 0 | 2010-01-01 | 9 |
| 1 | 2010-01-02 | 12 |
| 2 | 2010-01-03 | 3 |
| 3 | 2010-01-04 | 24 |
| 4 | 2010-01-05 | 14 |
Here we see the total amount of rats reproted in each borough, with Brooklyn having the highest count of rats. But in oder to get a better perspective, why dont we look at this on a map?
#show frequency of rat sightings by borough
rat_df['Borough'].value_counts()
BROOKLYN 84208 MANHATTAN 61172 BRONX 42182 QUEENS 34808 STATEN ISLAND 9250 Unspecified 25 Name: Borough, dtype: int64
#new dataframe showing boroughs
borough_df = pd.DataFrame({'Borough':['Brooklyn','Manhattan','Bronx', 'Queens', 'Staten Island', 'Unspecified'], 'Rat Sightings': [83461, 60699, 41917, 34507,9213,33]})
borough_df.head(5)
| Borough | Rat Sightings | |
|---|---|---|
| 0 | Brooklyn | 83461 |
| 1 | Manhattan | 60699 |
| 2 | Bronx | 41917 |
| 3 | Queens | 34507 |
| 4 | Staten Island | 9213 |
Now the figure below gives us a better visual of just how many rats have actually been reported throughout the last few decade. Each color represents a different borough and each dot represents a different report. But where exactly have the most rats been reported in each borough? A better quesiton is, where should you avoid hanging out in NYC?
import plotly.express as px
#don't include rows with no latitue or longitude
rat_df = rat_df.dropna(subset=['Latitude', 'Longitude'])
#don't include unspecified
filtered_rat_df = rat_df[rat_df['Borough'] != 'Unspecified']
#plotting the map with different colors for each borough
fig = px.scatter_mapbox(filtered_rat_df, lat='Latitude', lon='Longitude', color='Borough', mapbox_style='open-street-map')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
(Image of figure in case it can't be seen when file is converted to HTML)
Now, what are the spots with the most frequent rat sightings in each borough? The figure below uses the latitude and longitude of where the rat was sighted to determine just that.
#remove rows with missing latitude or longitude values
rat_df = rat_df.dropna(subset=['Latitude', 'Longitude'])
#do not include reports that say unspecfied borough
filtered_rat_df = rat_df[rat_df['Borough'] != 'Unspecified']
#group data by Borough and then find latitude and longitude with the most sightings
most_sightings = filtered_rat_df.groupby('Borough')[['Latitude', 'Longitude']].max().reset_index()
#make map showing which latitute and logitude had the most rat sightings for each borough
most_sightings['MarkerSize'] = 10
fig = px.scatter_mapbox(most_sightings,lat='Latitude',lon='Longitude', hover_name='Borough',zoom=10, color='Borough', mapbox_style='open-street-map', size='MarkerSize')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
(Image of figure incase it cannot be seen when turned into HTML)
Over time, rat sightings have increased, as seen in the plot below. This could be due to multiple factors including the amount of people and those who are inclined that call 311 to report a rat. There is a notable dip in 2020 and 2021 probably due to COVID-19 and the fact that people were not goign outside as much. Now, in 2023, we have seen the most rat sightings reported up to date!
#display rat frequency per year
rat_df.Year.plot.hist()
<Axes: ylabel='Frequency'>
But what good is it knowing the amount of rats if its not comparable to the size or population of each borough? To do that, we need to bring in more information about the population of each borough to calculate the amount of rats per 1000 people.
from google.colab import drive
population_df = pd.read_csv('/content/drive/My Drive/new_pop_again.csv')
population_df = population_df.drop([0]).reset_index(drop=True)
population_df['Population'] = population_df['Population'].str.replace(',', '').astype(int)
population_df.head(10)
| Index | GeoType | Borough | GeoID | Name | Population | |
|---|---|---|---|---|---|---|
| 0 | 2 | Boro | Manhattan | 1 | Manhattan | 1694251 |
| 1 | 3 | Boro | Bronx | 2 | Bronx | 1472654 |
| 2 | 4 | Boro | Brooklyn | 3 | Brooklyn | 2736074 |
| 3 | 5 | Boro | Queens | 4 | Queens | 2405464 |
| 4 | 6 | Boro | Staten Island | 5 | Staten Island | 495747 |
#combine borough and population then calculate per capita
result_df = pd.merge(borough_df, population_df, on='Borough')
#unspecified doesn't matter, it is so minimal
#delete unnecessary columns
result_df= result_df.drop(columns=['Index', 'GeoID', 'GeoType', 'Name'])
#rename rat sightings
result_df = result_df.rename(columns={'Rat Sightings': 'Rat_Sightings'})
result_df.head(7)
| Borough | Rat_Sightings | Population | |
|---|---|---|---|
| 0 | Brooklyn | 83461 | 2736074 |
| 1 | Manhattan | 60699 | 1694251 |
| 2 | Bronx | 41917 | 1472654 |
| 3 | Queens | 34507 | 2405464 |
| 4 | Staten Island | 9213 | 495747 |
#per capita rat sightings
result_df['per_capita_1000'] = (result_df['Rat_Sightings']/result_df['Population'])*1000
result_df.head(8)
| Borough | Rat_Sightings | Population | per_capita_1000 | |
|---|---|---|---|---|
| 0 | Brooklyn | 83461 | 2736074 | 30.503926 |
| 1 | Manhattan | 60699 | 1694251 | 35.826451 |
| 2 | Bronx | 41917 | 1472654 | 28.463577 |
| 3 | Queens | 34507 | 2405464 | 14.345257 |
| 4 | Staten Island | 9213 | 495747 | 18.584076 |
Using this newfound information, we can plot to analyze.
#result_df = result_df.set_index('Borough')
#plot
result_df.Rat_Sightings.plot.bar()
#label
plt.xlabel('Borough')
plt.ylabel('Rat Sightings')
plt.title('Rat Sightings by Borough')
plt.show()
#other graph for per capita rat sightings
result_df.per_capita_1000.plot.bar()
plt.xlabel('Borough')
plt.ylabel('Per 1000 people Rat Sightings')
plt.title('Per 1000 capita Rat Sightings by Borough')
plt.show()
So, although Brooklyn has the largest population of humans, when calculating the per capita rat sightings, Manhattan has the most rat sightings!
The next point of interest we will be tackling is how rat sightings have changed before and after the pandemic. To analyze that, we will be bringing in COVID-19 data that displays the case count by date and by borough in New York City.
#from google.colab import drive
#drive.mount('/content/drive')
COVID_df = pd.read_csv('/content/drive/My Drive/COVID_Daily_Counts.csv')
columns = ['date_of_interest', 'CASE_COUNT', 'BX_CASE_COUNT', 'BK_CASE_COUNT', 'MN_CASE_COUNT', 'QN_CASE_COUNT', 'SI_CASE_COUNT']
new_COVID_df = COVID_df[columns]
new_COVID_df[new_COVID_df.columns.get_loc('date_of_interest')] = pd.to_datetime(new_COVID_df['date_of_interest'], format='%m/%d/%Y')
new_COVID_df.head()
<ipython-input-31-18fa6543d91f>:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| date_of_interest | CASE_COUNT | BX_CASE_COUNT | BK_CASE_COUNT | MN_CASE_COUNT | QN_CASE_COUNT | SI_CASE_COUNT | 0 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 02/29/2020 | 1 | 0 | 0 | 1 | 0 | 0 | 2020-02-29 |
| 1 | 03/01/2020 | 0 | 0 | 0 | 0 | 0 | 0 | 2020-03-01 |
| 2 | 03/02/2020 | 0 | 0 | 0 | 0 | 0 | 0 | 2020-03-02 |
| 3 | 03/03/2020 | 1 | 0 | 0 | 0 | 1 | 0 | 2020-03-03 |
| 4 | 03/04/2020 | 5 | 0 | 1 | 2 | 2 | 0 | 2020-03-04 |
#merge covid cases and rat df
new_COVID_df['date_of_interest'] = pd.to_datetime(new_COVID_df['date_of_interest'], format='%m/%d/%Y')
rat_COVID_df = pd.merge(rat_per_day_df, new_COVID_df, left_on='formatted_date', right_on='date_of_interest', how='left')
# Drop the duplicate 'formatted_date' column from the COVID DataFrame
rat_COVID_df = rat_COVID_df.drop('date_of_interest', axis=1)
# Display the merged DataFrame
rat_COVID_df.head()
<ipython-input-32-68eaf60e929d>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| formatted_date | rat_sightings_count | CASE_COUNT | BX_CASE_COUNT | BK_CASE_COUNT | MN_CASE_COUNT | QN_CASE_COUNT | SI_CASE_COUNT | 0 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010-01-01 | 9 | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 1 | 2010-01-02 | 12 | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 2 | 2010-01-03 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 3 | 2010-01-04 | 24 | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 4 | 2010-01-05 | 14 | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
The following plot shows rat sightings as compared to COVID Case count per day from September 1st 2019 to August 30th 2023. This shows us how what the rat sightings looked like pre covid, and after covid. This plot shows us that rat sightings have increased after COVID because in 2019, the sightings were lower on average. As expected, the when COVID cases increased, rat sightings decreased which can probably be attributed to people not going outside when there were more cases.
rat_COVID_df.index = pd.to_datetime(rat_COVID_df.index)
date_column = 'formatted_date'
#filtering data by date
start_date = '2019-09-01'
end_date = '2023-08-30'
rat_COVID_zoomed = rat_COVID_df[(rat_COVID_df[date_column] >= start_date) & (rat_COVID_df[date_column] <= end_date)]
plt.figure(figsize=(18, 12))
#COVID Cases
plt.plot(rat_COVID_zoomed['formatted_date'], rat_COVID_zoomed['CASE_COUNT'], label='COVID Cases', color='blue')
plt.xlabel('Date')
#SECOND y axis for rat sightings so rat sightings doesnt look like nothing
ax2 = plt.gca().twinx()
ax2.plot(rat_COVID_zoomed['formatted_date'], rat_COVID_zoomed['rat_sightings_count'], label='Rat Sightings per Day', color='red')
ax2.set_ylabel('Rat Sightings per Day')
# Adding title and naming axis
plt.title('COVID Cases and Rat Sightings per day From 2019-2023')
plt.ylabel('COVID Cases')
plt.legend()
plt.grid(True)
plt.show()
The following plot shows the amount of COVID-19 cases throughout time and per borough. There was a high spike in cases around January 2022. We will use the borough COVID data and compare it to the borough rat sightings we analyzed above in the furutre.
import matplotlib.pyplot as plt
import pandas as pd
date_column = 'date_of_interest'
new_COVID_df[date_column] = pd.to_datetime(COVID_df[date_column])
#filtering data by date
start_date = '2019-09-01'
end_date = '2023-08-30'
filtered_df = new_COVID_df[(new_COVID_df[date_column] >= start_date) & (new_COVID_df[date_column] <= end_date)]
boroughs = ['BX_CASE_COUNT', 'BK_CASE_COUNT', 'MN_CASE_COUNT', 'QN_CASE_COUNT', 'SI_CASE_COUNT']
#resizing to see data better
plt.figure(figsize=(15, 12))
#plotting
for borough in boroughs:
plt.plot(filtered_df[date_column], filtered_df[borough], label=borough)
#adding title and naming axis
plt.title('COVID Cases vs. Time by Borough')
plt.xlabel('Date')
plt.ylabel('Case Count')
plt.legend()
plt.grid(True)
plt.show()
<ipython-input-34-c2cc92f122e8>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In future analysis, we may also attempt to use other features of the rat data to analyze where you can expect to see the most rats in the future. We will use KNN analysis to do so in our future milestone. Another question that we want to answer is
import seaborn as sns
correlation_matrix = rat_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()
<ipython-input-35-1eef80d69bb4>:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.